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Abstract 

In this report we demonstrate the potential utility of re- 
source allocation management systems that use virtual 
machine technology for sharing parallel computing re- 
sources among competing jobs. We formalize the resource 
allocation problem with a number of underlying assump- 
tions, determine its complexity, propose several heuris- 
tic algorithms to find near-optimal solutions, and evalu- 
ate these algorithms in simulation. We find that among 
our algorithms one is very efficient and also leads to the 
best resource allocations. We then describe how our ap- 
proach can be made more general by removing several 
of the underlying assumptions. 



1. Introduction 

The use of commodity clusters has become main- 
stream for high-performance computing applications, 
with more than 80% of today's fastest supercomputers be- 
ing clusters [-ry]. Large-scale data processing [_.i,33,39] 
and service hosting [4,22] are also common applications. 
These clusters represent significant equipment and infras- 
tructure investment, and having a high rate of utilization 
is key for justifying their ongoing costs (hardware, power, 
cooling, staff) ', "^'"l. There is therefore a strong incen- 
tive to share these clusters among a large number of ap- 
plications and users. 

The sharing of compute resources among competing 
instances of applications, or jobs, within a single sys- 
tem has been supported by operating systems for decades 
via time-sharing. Time-sharing is implemented with rapid 
context- switching and is motivated by a need for inter- 



activity. A fundamental assumption is that there is no or 
little a-priori knowledge regarding the expected work- 
load, including expected durations of running processes. 
This is very different from the current way in which clus- 
ters are shared. Typically, users request some fraction 
of a cluster for a specified duration. In the traditional 
high-performance computing arena, the ubiquitous ap- 
proach is to use "batch scheduling", by which jobs are 
placed in queues waiting to gain exclusive access to a sub- 
set of the platform for a bounded amount of time. In ser- 
vice hosting or cloud environments, the approach is to al- 
low users to lease "virtual slices" of physical resources, 
enabled by virtual machine technology. The latter ap- 
proach has several advantages, including 0/S customiza- 
tion and interactive execution. In general resource shar- 
ing among competing jobs is difficult because jobs have 
different resource requirements (amount of resources, 
time needed) and because the system cannot accommo- 
date all jobs at once. 

An important observation is that both resource alloca- 
tion approaches mentioned above dole out integral sub- 
sets of the resources, or allocations (e.g., 10 physical 
nodes, 20 virtual slices), to jobs. Furthermore, in the 
case of batch scheduling, these subsets cannot change 
throughout application execution. This is a problem be- 
cause most applications do not use all resources allo- 
cated to them at all times. It would then be useful to 
be able to decrease and increase application allocations 
on-the-fly (e.g., by removing and adding more physical 
cluster nodes or virtual slices during execution). Such ap- 
plication are termed "malleable" in the literature. While 
solutions have been studied to implement and to sched- 
ule malleable applications [ 1 2, 25, 46, 5 1 , 52], it is often 
difficult to make sensible malleability decisions at the ap- 
phcation level. Furthermore, many applications are used 
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as-is, with no desire or possibility to re-engineer them 
to be malleable. As a result sensible and automated mal- 
leability is rare in real-world applications. This is perhaps 
also due to the fact that production batch scheduling envi- 
ronments do not provide mechanisms for dynamically in- 
creasing or decreasing allocations. By contrast, in service 
hosting or cloud environments, acquiring and relinquish- 
ing virtual slices is straightforward and can be imple- 
mented via simple mechanisms. This provides added mo- 
tivation to engineer applications to be malleable in those 
environments. 

Regardless, an application that uses only 80% of a 
cluster node or of a virtual slice would need to relin- 
quish only 20% of this resources. However, current re- 
source allocation schemes allocate integral numbers of 
resources (whether these are physical cluster nodes or vir- 
tual slices). Consequently, many applications are denied 
access to resources, or delayed, in spite of cluster re- 
sources not being fully utilized by the applications that 
are currently executing, which hinders both application 
throughput and cluster utilization. 

The second limitation of current resource allocation 
schemes stems from the fact that resource allocation with 
integral allocations is difficult from a theoretical perspec- 
tive [10]. Resource allocation problems are defined for- 
mally as the optimizations of well-defined objective func- 
tions. Due to the difficulty (i.e., NP-hardness) of resource 
allocation for optimizing an objective function, in the 
real-world no such objective function is optimized. For 
instance, batch schedulers instead provide a myriad of 
configuration parameters by which a cluster administra- 
tor can tune the scheduling behavior according to ad-hoc 
rules of thumb. As a result, it has been noted that there is 
a sharp disconnect between the desires of users (low ap- 
plication turn-around time, fairness) and the schedules 
computed by batch schedulers [„ .,44]. It turns out that 
cluster administrators often attempt to maximize clus- 
ter utilization. But recall that, paradoxically, current re- 
source allocation schemes inherently hinder cluster uti- 
Uzation! 

A notable finding in the theoretical literature is that 
with job preemption and/or migration there is more flexi- 
bility for resource allocation. In this case certain resource 
allocation problems become (more) tractable or approx- 
imable [ , I 1 , 27, 33, 44]. Unfortunately, preemption and 
migration are rarely used on production parallel plat- 
forms. The gang scheduling [ ] approach allows en- 
tire parallel jobs to be context-switched in a synchronous 
fashion. Unfortunately, a known problem with this ap- 
proach is the overhead of coordinated context switching 
on a parallel platform. Another problem is the memory 
pressure due to the fact that cluster applications often use 
large amounts of memory, thus leading to costly swap- 
ping between memory and disk [y]. Therefore, while flex- 



ibility in resource allocations is desirable for solving re- 
source allocation problems, affording this flexibility has 
not been successfully accomplished in production sys- 
tems. 

In this paper we argue that both limitations of current 
resource allocation schemes, namely, reduced utilization 
and lack of an objective function, can be addressed simul- 
taneously via fractional and dynamic resource allocations 
enabled by state-of-the-art virtual machine (VM) technol- 
ogy. Indeed, applications running in VM instances can be 
monitored so as to discover their resource needs, and their 
resource allocations can be modified dynamically (by ap- 
propriately throttling resource consumption and/or by 
migrating VM instances). Furthermore, recent VM tech- 
nology advances make the above possible with low over- 
head. Therefore, it is possible to use this technology for 
resource allocation based on the optimization of sensi- 
ble objective functions, e.g., ones that capture notions of 
performance and fairness. 

Our contributions are: 

• We formalize a general resource allocation prob- 
lem based on a number of assumptions regarding 
the platform, the workload, and the underlying VM 
technology; 

• We establish the complexity of the problem and 
propose algorithms to solve it; 

• We evaluate our proposed algorithms in simula- 
tion and identify an algorithm that is very efficient 
and leads to better resource allocations than its com- 
petitors; 

• We validate our assumptions regarding the capabil- 
ities of VM technology; 

• We discuss how some of our other assumptions can 
be removed and our approach adapted to parallel 
jobs and dynamic jobs. 

This paper is organized as follows. In Section 2 we 
define and formalize our target problem, we list our as- 
sumptions for the base problem, and we establish its NP- 
hardness. In Section 3 we propose algorithms for solving 
the base problem and evaluate these algorithms in simu- 
lation in Section 4. Sections 5 and 6 study the resource 
sharing problem with relaxed assumptions regarding the 
nature of the workload, thereby handling parallel and dy- 
namic workloads. In Section 7 we validate our fundamen- 
tal assumption that VM technology allows for precise 
resource sharing. Section 8 discusses related work. Sec- 
tion 9 discusses future directions. Finally, Section 10 con- 
cludes the paper with a summary of our findings. 
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Figure 1. System architecture with 12 homogeneous physical hosts and 3 running virtual clus- 
ters. 



2. Flexible Resource Allocation 

2.1. Overview 

In this work we consider a homogeneous cluster plat- 
form, which is managed by a resource allocation system. 
The architecture of this system is depicted in Figure 1 . 
Users submit job requests, and the system responds by 
creating sets of VM instances, or "virtual clusters" (VC) 
to run the jobs. These instances run on physical hosts that 
are each under the control of a VM monitor [8, 34, 5. ]. 
The VM monitor can enforce specific resource consump- 
tion rates for different VMs running on the host. All 
VM monitors are under the control of a VM manage- 
ment system that can specify resource consumption rates 
for VM instances running on the physical cluster. Fur- 
thermore, the VM resource management system can en- 
act VM instance migrations among physical hosts. An 
example of such a system is the Usher project [" ]. Fi- 
nally, a Resource Allocator (RA) makes decisions regard- 
ing whether a request for a VC should be rejected or 
admitted, regarding possible VM migrations, and regard- 
ing resource consumption rates for each VM instance. 

Our overall goal is to design algorithms implemented 
as part of the RA that make all virtual clusters "play 
nice" by allowing fine-grain tuning of their resource con- 
sumptions. The use of VM technology is key for increas- 
ing cluster utilization, as it makes is possible to allo- 
cate to VCs only the resources they need when they need 
them. The mechanisms for allowing on-the-fly modifica- 
tion of resource allocations are implemented as part of 
the VM Monitors and the VM Management System. 



A difficult question is how to define precisely what 
"playing nice" means, as it should encompass both no- 
tions of individual job performance and notions of fair- 
ness among jobs. We address this issue by defining a per- 
formance metric that encompasses both these notions and 
that can be used to value resource allocations. The RA 
may be configured with the goal of optimizing this met- 
ric but at the same time ensuring that the metric across the 
jobs is above some threshold (for instance by rejecting re- 
quests for new virtual clusters). More generally, a key 
aspect of our approach is that it can be combined with re- 
source management and accounting techniques. For in- 
stance, it is straightforward to add notions of user prior- 
ities, of resource allocation quotas, of resource alloca- 
tion guarantees, or of coordinated resource allocations 
to VMs belonging to the same VC. Furthermore, the RA 
can reject or delay VC requests if the performance met- 
ric is below some acceptable level, to be defined by clus- 
ter administrators. 

2.2. Assumptions 

We first consider the resource sharing problem using 
the following six assumptions regarding the workload, 
the physical platform, and the VM technology in use: 

(HI) Jobs are CPU-bound and require a given amount 
of memory to be able to run; 

(H2) Job computational power needs and memory re- 
quirements are known; 

(H3) Each job requires only one VM instance; 
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(H4) The workload is static, meaning jobs have constant 
resource requirements; furthermore, no job enters 
or leaves the system; 

(H5) VM technology allows for precise, low-overhead, 
and quickly adaptable sharing of the computational 
capabilities of a host across CPU-bound VM in- 
stances. 

These assumptions are very stringent, but provide a 
good framework to formalize our resource allocation 
problem (and to prove that it is difficult even with these 
assumptions). We relax assumption H3 in Section 5, that 
is, we consider parallel jobs. Assumption H4 amounts 
to assuming that jobs have no time horizons, i.e., that 
they run forever with unchanging requirements. In prac- 
tice, the resource allocation may need to be modified 
when the workload changes (e.g., when a new job ar- 
rives, when a job terminates, when a job starts needing 
more/fewer resources). In Section 6 we relax assumption 
H4 and extend our approach to allow allocation adapta- 
tion. We validate assumption H5 in Section 7. We leave 
relaxing HI and H2 for future work, and discuss the in- 
volved challenges in Section 10. 

2.3. Problem Statement 

We call the resource allocation problem described in 
the previous section VCSCHED and define it here for- 
mally. Consider H > identical physical hosts and 
J > jobs. For job i, i — 1, . . . , J, let be the (aver- 
age) fraction of a host's computational capability utilized 
by the job if alone on a physical host, < 0;^ < 1. (Al- 
ternatively, this fraction could be specified a-priori by 
the user who submitted/launched job i.) Let be the 
maximum fraction of a host's memory needed by job i, 
< rrii < 1. Let aij be the fraction of the computa- 
tional capability of host j, j — 1, . . . ,H, allocated to 
job i, i — 1, . . . , J. We have < aij < 1. If aij is con- 
strained to be an integer, that is either or 1, then the 
model is that of scheduling with exclusive access to re- 
sources. If, instead, aij is allowed to take rational values 
between and 1, then resource allocations can be frac- 
tional and thus more fine-grain. 

Constraints - We can write a few constraints due to re- 
source limitations. We have 

.7 

1=1 

which expresses the fact that the total CPU fraction allo- 
cated to jobs on any single host may not exceed 100%. 
Also, a job should not be allocated more resource than it 
can use: 

H 

Vi ^ aij < ai , 



Similarly, 

J 

^laijlmi < 1 , (1) 

1=1 

since at most the entire memory on a host may be used. 

With assumption H3, a job requires only one VM in- 
stance. Furthermore, as justified hereafter, we assume 
that we do not use migration and that a job can be allo- 
cated to a single host. Therefore, we write the following 
constraints: 

H 

\/i ^[a,,]=l, (2) 
i=i 

which state that for all i only one of the aij values is 
non-zero. 

Objective function - We wish to optimize a performance 
metric that encompasses both notions of performance 
and of fairness, in an attempt at designing the sched- 
uler from the start with a user-centric metric in mind 
(unlike, for instance, current batch schedulers). In the tra- 
ditional parallel job scheduling literature, the metric com- 
monly acknowledged as being a good measure for both 
performance and fairness is the stretch (also called "slow- 
down") [10, 16]. The stretch of a job is defined as the 
job's turn-around time divided by the turn-around time 
that would have been achieved had the job been alone in 
the system. 

This metric cannot be applied directly in our con- 
text because jobs have no time horizons. So, instead, we 
use a new metric, which we call the yield and which we 
define for job i as aij/ai. The yield of a job repre- 
sents the fraction of its maximum achievable compute 
rate that is achieved (recall that for each i only one of 
the aij is non-zero). A yield of 1 means that the job con- 
sumes compute resources at its peak rate. We can now de- 
fine problem VCSCHED as maximizing the minimum 
yield in an attempt at optimizing both performance and 
fairness (similar in spirit to minimizing the maximum 
stretch [10, 27]). Note that we could easily maximize the 
average yield instead, but we may then decrease the fair- 
ness of the resource allocation across jobs as average 
metrics are starvation-prone [ ]. Our approach is agnos- 
tic to the particular objective function (although some 
of our results hold only for linear objective functions). 
For instance, other ways in which the stretch can be opti- 
mized have been proposed [7] and could be adapted for 
our yield metric. 

Migration - The formulation of our problem precludes 
the use of migration. However, as when optimizing job 
stretch, migration could be used to achieve better re- 
sults. Indeed, assuming that migration can be done with 
no overhead or cost whatsoever, migrating tasks among 
hosts in a periodic steady-state schedule afford more flex- 
ibiUty for resource sharing, which could in turn be used 
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to maximize the minimum yield further. For instance, 
consider 2 hosts and 3 tasks, with ol\ — a-i = a-}, — 
0.6. Without migration the optimal minimum yield is 
0.5/0.6 ^ .83 (which corresponds to an allocation in 
which two tasks are on the same host and each receive 
50% of that host's computing power). With migration 
it is possible to do better. Consider a periodic schedule 
that switches between two allocations, so that on aver- 
age the schedule uses each allocation 50% of the time. 
In the first allocation tasks 1 and 2 share the first host, 
each receiving 45% and 55% of the host's computing 
power, respectively, and task 3 is on the second host by it- 
self, thus receiving 60% of its compute power. In the 
second allocation, the situation is reversed, with task 1 
by itself on the first host and task 2 and 3 on the sec- 
ond host, task 2 receiving 55% and task 3 receiving 45%. 
With this periodic schedule, the average yield of task 1 
and 3 is .5 X (.45/.60 + .60/. 60) - .87 , and the av- 
erage yield of task 2 is .55/. 60 ^ .91. Therefore the 
minimum yield is .87, which is higher than that in the 
no-migration case. 

Unfortunately, the assumption that migration comes 
at no cost or overhead is not realistic. While recent ad- 
vances in VM migration technology [ ] make it possible 
for a VM instance to change host with a nearly imper- 
ceptible delay, migration consumes network resources. 
It is not clear whether the pay-off of these extra migra- 
tions would justify the added cost. It could be interesting 
to allow a bounded number of migrations for the pur- 
pose of further increasing minimum yield, but for now 
we leave this question for future work. We use migra- 
tion only for the purpose of adapting to dynamic work- 
loads (see Section 6). 

2.4. Complexity Analysis 

Let us consider the decision problem associated to 
VCSCHED: Is it possible to find an allocation so that its 
minimum yield is above a given bound, Kl We term this 
problem VCSched-Dec. Not surprisingly, VCSCHED- 
Dec is NP-complete. For instance, considering only job 
memory constraints and two hosts, the problem triv- 
ially reduces to 2-Partition, which is known to be 
NP-complete in the weak sense [ ]. We can actually 
prove a stronger result: 

Theorem 1. VCSched-Dec is NP-complete in the 
strong sense even if host memory capacities are infinite. 

Proof. VCSched-Dec belongs to NP because a so- 
lution can easily be checked in polynomial time. To 
prove NP-completeness, we use a straightforward re- 
duction from 3 -Partition, which is known to be NP- 
complete in the strong sense [la]. Let us consider, Ii, 



an arbitrary instance of 3 -PARTITION: given 3n posi- 
tive integers {ai, . . . ,a3„} and a bound R, assuming 
that ^ < Ui < ^ for all i and that '^j — 

there a partition of these numbers into n disjoint sub- 
sets Ji, . . . , /„ such that — ^ fo'" (Note 
that — 3 for all i.) We now build I2, an instance 
of VCSCHED as follows. We consider H = n hosts 
and J = 3n jobs. For job j we set aj = aj/R and 
fUj = 0. Setting nij to amounts to assuming that there 
is no memory contention whatsoever, or that host memo- 
ries are infinite. Finally, we set K, the bound on the yield, 
to be 1. We now prove that Ii has a solution if and only 
if I2 has a solution. 

Let us assume thatli has a solution. For each job j, 
we assign it to host i if j € Ii, and we give it all the 
compute power it needs {aji — aj/R). This is possible 
because J2jei '^j = which implies that J2jei '^fi ~ 
R/R < 1. In other terms, the computational capacity 
of each host is not exceeded. As a result, each job has a 
yield of K = 1 and we have built a solution to I2. 

Let us now assume that I2 has a solution. Then, for 
each job j there exists a unique ij such that aji — aj, 
and such that aji — for i ^ ij (i.e., job j is allocated 
to host ij). Let us define Ii — {j\ij — i}. By this def- 
inition, the Ii sets are disjoint and form a partition of 
{!,..., 3n}. 

To ensure that each processor's compute capability 
is not exceeded, we must have X^je/ '^i — ^ ^'^^ 
However, by construction of I2, Yl^=i — ^- There- 
fore, since the Ii sets form a partition of {!,..., 3n}, 
J2jei '^i exactly equal to 1 for all i. Indeed, if 
J2jei- '^j were strictly lower than 1 for some ii, then 
J2jei '^j would have to be greater than 1 for some 
^2, meaning that the computational capability of a host 
would be exceeded. Since aj = aj/R, we obtain 

J2jei- ^3 ~ ^ ^'^^ '^^'•^ ^^^^ ^ solution to 

Ii , which concludes the proof. 

□ 



2.5. Mixed-Integer Linear Program Formula- 
tion 

It turns out that VCSCHED can be formulated as a 
mixed-integer linear program (MILP), that is an optimiza- 
tion problem with linear constraints and objective func- 
tion, but with both rational and integer variables. Among 
the constraints given in Section 2.3, the constraints in 
Eq. I and Eq. 2 are non-linear. These constraints can eas- 
ily be made hnear by introducing a binary integer vari- 
ables, Cij, set to 1 if job i is allocated to resource j, and 
to otherwise. We can then rewrite the constraints in Sec- 
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tion 2.3 as follows, with i = 1, . . . , J and j = 1, . . . , H: 





e N, 


(3) 






(4) 


Vj,j 


< eij < 1, 


(5) 




< < Bij, 


(6) 


Vi 




(7) 


Vj 




(8) 


Vj 




(9) 


Vi 




(10) 


Vi 


^J = l a; ^ 


(11) 



Recall that rrii and are constants that define the jobs. 
The objective is to maximize Y, i.e., to maximize the 
minimum yield. 

3. Algorithms for Solving VC- 

SCHED 

In this section we propose algorithms to solve VC- 
SCHED, including exact and relaxed solutions of the 
MILP in Section 2.5 as well as ad-hoc heuristics. We 
also give a generally applicable technique to improve av- 
erage yield further once the minimum yield has been 
maximized. 

3.1. Exact and Relaxed Solutions 

In general, solving a MILP requires exponential time 
and is only feasible for small problem instances. We use 
a publicly available MILP solver, the Gnu Linear Pro- 
gramming Toolkit (GLPK), to compute the exact solution 
when the problem instance is small (i.e., few tasks and/or 
few hosts). We can also solve a relaxation of the MILP 
by assuming that all variables are rational, converting the 
problem to a LP. In practice a rational linear program can 
be solved in polynomial time. However, the resulting so- 
lution may be infeasible (namely because it could spread 
a single job over multiple hosts due to non-binary Cij val- 
ues), but has two important uses. First, the value of the 
objective function is an upper bound on what is achiev- 
able in practice, which is useful to evaluate the abso- 
lute performance of heuristics. Second, the rational so- 
lution may point the way toward a good feasible solu- 
tion that is computed by rounding off the Cij values to 
integer values judiciously, as discussed in the next sec- 
tion. 

It turns out that we do not need a linear program 
solver to compute the optimal minimum yield for the re- 
laxed program. Indeed, if the total of job memory re- 
quirement is not larger than the total available memory 



(i.e., if "^i — '^hsri there is a solution to the re- 
laxed version of the problem and the achieved optimal 
minimum yield, Y}pf \ can be computed easily: 

Vpt = ' i|- 

The above expression is an obvious upper bound on the 
maximum minimum yield. To show that it is in fact the 
optimal, we simply need to exhibit an allocation that 
achieves this objective. A simple such allocation is: 

Vi,j and Uij = ^aiY^^f\ 

3.2. Algorithms Based on Relaxed Solutions 

We propose two heuristics, RRND and RRNZ, that 
use a solution of the rational LP as a basis and then 
round-off rational Cij value to attempt to produce a feasi- 
ble solution, which is a classical technique. In the previ- 
ous section we have shown a solution for the LP; Unfor- 
tunately, that solution has the undesirable property that 
it splits each job evenly across all hosts, meaning that 
all Cij values are identical. Therefore it is a poor start- 
ing point for heuristics that attempt to round off Cij val- 
ues based on their magnitude. Therefore, we use GLPK 
to solve the relaxed MILP and use the produced solu- 
tion as a starting point instead. 

Randomized Rounding (RRND) - This heuristic first 
solves the LP. Then, for each job i (taken in an arbitrary 
order), it allocates it to a random host using a probabil- 
ity distribution by which host j has probability e ^ of be- 
ing selected. If the job cannot fit on the selected host 
because of memory constraints, then that host's prob- 
ability of being selected is set to zero and another at- 
tempt is made with the relative probabilities of the re- 
maining hosts adjusted accordingly. If no selection has 
been made and every host has zero probability of being 
selected, then the heuristic fails. Such a probabilistic ap- 
proach for rounding rational variable values into integer 
values has been used successfully in previous work [31]. 
Randomized Rounding witli No Zero probabil- 
ity (RRNZ) - This heuristic is a slight modifica- 
tion of the RRND heuristic. One problem with RRND 
is that a job, i, may not fit (in terms of memory re- 
quirements) on any of the hosts, j, for which Cij > 0, 
in which case the heuristic would fail to generate a so- 
lution. To remedy this problem, we set each e^j value 
equal to zero in the solution of the relaxed MILP to e in- 
stead, where e << 1 (we used e — 0.01). For those 
problem instances for which RRND provides a solu- 
tion RRNZ should provide nearly the same solution 
most of the time. But RRNZ should also provide a so- 
lution to a some instances for which RRND fails, thus 
achieving a better success rate. 
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3.3. Greedy Algorithms 

Greedy (GR) - This heuristic first goes through the list 
of jobs in arbitrary order. For each job the heuristic ranks 
the hosts according to their total computational load, 
that is, the total of the maximum computation require- 
ments of all jobs already assigned to a host. The heuristic 
then selects the first host, in non-decreasing order of com- 
putational load, for which an assignment of the current 
job to that host will satisfy the job's memory require- 
ments. 

Sorted-Task Greedy (SG) - This version of the greedy 
heuristic first sorts the jobs in descending order by their 
memory requirements before proceeding as in the stan- 
dard greedy algorithm. The idea is to place relatively 
large jobs while the system is still lightly loaded. 
Greedy with Backtracking (GB) - It is possible to mod- 
ify the GR heuristic to add backtracking. Clearly full- 
fledged backtracking would lead to 100% success rate 
for all instances that have a feasible solution, but it would 
also require potentially exponential time. One thus needs 
methods to prune the search tree. We use a simple method, 
placing an arbitrary bound (500,000) on the number of 
job placement attempts. An alternate pruning technique 
would be to restrict placement attempts to the top 25% 
candidate placements, but based on our experiments it is 
vastly inferior to using an arbitrary bound on job place- 
ment attempts. 

Sorted Greedy witli Backtracking (SGB) - This ver- 
sion is a combination of SG and GB, i.e., tasks are sorted 
in descending order of memory requirement as in SG and 
backtracking is used as in GB. 

3.4. Multi-Capacity Bin Packing Algorithms 

Resource allocation problems are often akin to bin 
packing problems, and VCSCHED is no exception. There 
are however two important differences between our prob- 
lem and bin packing. First, our tasks resource require- 
ments are dual, with both memory and CPU require- 
ments. Second, our CPU requirements are not fixed but 
depend on the achieved yield. The first difference can be 
addressed by using "multi-capacity" bin packing heuris- 
tics. Two Multi-capacity bin packing heuristics were pro- 
posed in [ ] for the general case of d-capacity bins and 
items, but in the d = 2 case these two algorithms turn 
out to be equivalent. The second difference can be ad- 
dressed via a binary search on the yield value. 

Consider an instance of VCSCHED and a fixed value 
of the yield, Y, that needs to be achieved. By fixing Y, 
each task has both a fixed memory requirement and a 
fixed CPU requirement, both taking values between 
and 1, making it possible to apply the algorithm in [28] 
directly. 



Accordingly, one splits the tasks into two lists, with 
one list containing the tasks with higher CPU require- 
ments than memory requirements and the other contain- 
ing the tasks with higher memory requirements than CPU 
requirements. One then sorts each list. In "] the lists 
are sorted according to the sum of the CPU and mem- 
ory requirements. 

Once the lists are sorted, one can start assigning tasks 
to the first host. Lists are always scanned in order, search- 
ing for a task that can "fit" on the host, which for the sake 
of this discussion we term a "possible task". Initially one 
searches for a possible task in one and then the other 
list, starting arbitrarily with any list. This task is then as- 
signed to the host. Subsequently, one always searches 
for a possible task from the list that goes against the cur- 
rent imbalance. For instance, say that the host's available 
memory capacity is 50% and its available CPU capac- 
ity is 80%, based on tasks that have been assigned to 
it so far. In this case one would scan the list of tasks 
that have higher CPU requirements than memory require- 
ments to find a possible task. If no such possible task 
is found, then one scans the other list to find a possi- 
ble task. When no possible tasks are found in either list, 
one starts this process again for the second host, and so 
on for all hosts. If all tasks can be assigned in this man- 
ner on the available hosts, then resource allocation is suc- 
cessful. Otherwise resource allocation fails. 

The final yield must be between 0, representing fail- 
ure, and the smaller of 1 or the total computation capacity 
of all the hosts divided by the total computational require- 
ments of all the tasks. We arbitrarily choose to start at 
one-half of this value and perform a binary search of pos- 
sible minimum yield values, seeking to maximize mini- 
mum yield. Note that under some circumstances the al- 
gorithm may fail to find a valid allocation at a given po- 
tential yield value, even though it would find one given a 
larger yield value. This type of failure condition is to be 
expected when applying heuristics. 

While the algorithm in [_ .] sorts each list by the sum 
of the memory and CPU requirements, there are other 
likely sorting key candidates. For completeness we exper- 
iment with 8 different options for sorting the lists, each 
resulting in a MCB (Multi-Capacity Bin Packing) algo- 
rithm. We describe all 8 options below: 

• MCBl: memory + CPU, in ascending order; 

• MCB2: max(memory,CPU) - min(memory,CPU), 
in ascending order; 

• MCB3: max(memory,CPU) / min(memory,CPU), 
in ascending order; 

• MCB4: max(memory,CPU), in ascending order; 

• MCB5; memory + CPU, in descending order; 
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• MCB6: max(memory,CPU) - min(memory,CPU), 
in descending order. 

• MCB7: max(memory,CPU) / min(memory,CPU), 
in descending order; 

• MCB8: max(memory,CPU), in descending order; 

3.5. Increasing Average Yield 

While the objective function to be maximized for solv- 
ing VCSCHED is the minimum yield, once an allocation 
that achieves this goal has been found there may be ex- 
cess computational resources available which would be 
wasted if not allocated. Let us call y the maximized mini- 
mum yield value computed by one of the aforementioned 
algorithms (either an exact value obtained by solving 
the MILP, or a likely sub-optimal value obtained with a 
heuristic). One can then solve a new linear program sim- 
ply by adding the constraint Y >y and seeking to max- 
imize J2j ctij/oii, the average yield. Unfortunately 
this new program also contains both integer and ratio- 
nal variables, therefore requiring exponential time for 
computing an exact solution. Therefore, we choose to im- 
pose the additional condition that the Cjj values be un- 
changed in this second round of optimization. In other 
terms, only CPU fractions can be modified to improve av- 
erage yield, not job locations. This amounts to replacing 
the Cij variables by their values as constants when maxi- 
mizing the average yield and the new linear program has 
then only rational variables. 

It turns out that, rather than solving this linear pro- 
gram with a linear program solver, we can use the fol- 
lowing optimal greedy algorithm. First, for each job i as- 
signed to host j, we set the fraction of the compute capa- 
bility of host j given to job i to the value exactly achiev- 
ing the maximum mirumum yield: aij = ai.y. Then, 
for each host, we scale up the compute fraction of the 
job with smallest compute requirement Qj until either 
the host has no compute capability left or the job's com- 
pute requirement is fully fulfilled. In the latter case, we 
then apply the same scheme to the job with the second 
smallest compute requirement on that host, and so on. 
The optimaUty of this process is easily proved via a typi- 
cal exchange argument. 

All our heuristics use this average yield maximization 
technique after maximizing the minimum yield. 

4. Simulation Experiments 

We evaluate our heuristics based on four mettles; 
(i) the achieved minimum yield; (ii) the achieved av- 
erage yield; (ill) the failure rate; and (iv) the run time. 
We also compare the heuristics with the exact solution of 



the MILP for small instances, and to the (unreachable up- 
per bound) solution of the rational LP for all instances. 
The achieved minimum and average yields considered 
are average values over successfully solved problem in- 
stances. The run times given include only the time re- 
quired for the given heuristic since aU algorithms use the 
same average yield maximization technique. 

4.1. Experimental Methodology 

We conducted simulations on synthetic problem in- 
stances. We defined these instances based on the num- 
ber of hosts, the number of jobs, the total amount of free 
memory, or memory slack, in the system, the average job 
CPU requirement, and the coefficient of variance of both 
the memory and CPU requirements of jobs. The mem- 
ory slack is used rather than the average job memory re- 
quirement since it gives a better sense of how tightly 
packed the system is as a whole. In general (but not al- 
ways) the greater the slack the greater the number of fea- 
sible solutions to VCSCHED. 

Per-task CPU and memory requirements are sampled 
from a normal distribution with given mean and coeffi- 
cient of variance, truncated so that values are between 
and 1. The mean memory requirement is defined as 

* (1 — slack)/ J, where slack has value between 
and 1 . The mean CPU requirement is taken to be 0.5, 
which in practice means that feasible instances with fewer 
than twice as many tasks as hosts have a maximum mini- 
mum yield of 1.0 with high probability. We do not ensure 
that every problem instance has a feasible solution. 

Two different sets of problem instances are examined. 
The first set of instances, "small" problems, includes in- 
stances with small numbers of hosts and tasks. Exact op- 
timal solutions to these problems can be found with a 
MILP solver in a tractable amount of time (from a few 
minutes to a few hours on a 3.2Ghz machine using the 
GLPK solver). The second set of instances, "large" prob- 
lems, includes instances for which the numbers of hosts 
and tasks are too large to compute exact solutions. For the 
small problem set we consider 4 hosts with 6, 8, 10, or 12 
tasks. Slack ranges from 0. 1 to 0.9 with increments of 0. 1, 
while coefficients of variance for memory and CPU re- 
quirements are given values of 0.25 and 0.75, for a to- 
tal of 144 different problem specifications. 10 instances 
are generated for each problem specification, for a to- 
tal of 1,440 instances. For the large problem set we con- 
sider 64 hosts with sets of 100, 250 and 500 tasks. Slack 
and coefficients of variance for memory and CPU re- 
quirements are the same as for the small problem set for 
a total of 108 different problems specifications. 100 in- 
stances of each problem specification were generated for 
a total of 10,800 instances. 
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Figure 2. MCB Algorithms - IVIinimum 
Yield vs. Slack for small problem in- 
stances. 
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Figure 3. MCB Algorithms - Average Yield 
vs. Slack for small problem instances. 



4.2. Experimental Results 

4.2.1. Multi-Capacity Bin Packing We first present 
results only for our 8 multi-capacity bin packing algo- 
rithms to determine the best one. Figure 2 shows the 
achieved maximum minimum yield versus the mem- 
ory slack averaged over small problem instances. As ex- 
pected, as the memory slack increases all algorithms tend 
to do better although some algorithms seem to experi- 
ence slight decreases in performance beyond a slack of 
0.4. Also expected, we see that the four algorithms that 
sort the tasks by descending order outperform the four 
that sort them by ascending order. Indeed, it is known 
that for bin packing starting with large items typically 



algorithm 


% deg. 


irom best 


avg. 


max 


MCB 8 


1.06 


40.45 






JO.\J 1 


MCB6 


1.83 


37.61 


MCB7 


3.91 


40.71 


MCB3 


11.76 


55.73 


MCB2 


14.21 


48.30 


MCBl 


14.90 


55.84 


MCB4 


17.32 


46.95 



Table 1. Average and Maximum percent 
degradation from best of the MCB algo- 
rithms for small problem instances. 



leads to better results on average. 

The main message here is that MCB8 outperforms all 
other algorithms across the board. This is better seen in 
Table 1 , which shows the average and maximum percent 
degradation from best for all algorithms. For a problem 
instance, the percent degradation from best of an algo- 
rithm is defined as the difference, in percentage, between 
the minimum yield achieved by an algorithm and the min- 
imum yield achieved by the best algorithm for this in- 
stance. The average and maximum percent degradations 
from best are computed over all instances. We see that 
MCB8 has the lowest average percent degradation from 
best. MCB5, which corresponds to the algorithm in ,] 
performs well but not as well as MCB 8. In terms of max- 
imum percent degradation from best, we see that MCB8 
ranks third, overtaken by MCB5 and MCB6. Examin- 
ing the results in more details shows that, for these small 
problem instances, the maximum degradation from best 
are due to outliers. For instance, for the MCB 8 algorithm, 
out of the 1,379 solved instances, there are only 155 in- 
stances for which the degradation from best if larger than 
3%, and only 19 for which it is larger than 10%. 

Figure 3 shows the average yield versus the slack (re- 
call that the average yield is optimized in a second phase, 
as described in Section 3.5). We see here again that the 
MCB8 algorithm is among the very best algorithms. 

Figure 4 shows the failure rates of the 8 algorithms 
versus the memory slack. As expected failure rates de- 
crease as the memory slack increases, and as before we 
see that the four algorithms that sort tasks by descend- 
ing order outperform the algorithms that sort tasks by as- 
cending order. Finally, Figure 5 shows the runtime of 
the algorithms versus the number of tasks. We use a 
3.2GHz Intel Xeon processor. All algorithms have aver- 
age run times under 0.18 milliseconds, with MCB8 the 
fastest by a tiny margin. 

Figures 6, 8, 9, and 10 are similar to Figures 2, 3, 
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Figure 4. MCB Algorithms - Failure Rate 
vs. Slack for small problem instances. 
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Figure 6. MCB Algorithms - Minimum 
Yield vs. Slack for large problem in- 
stances. 
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Figure 5. MCB Algorithms - Run time vs. 
Number of Tasks for small problems in- 
stances. 



4, and 5, but show results for large problem instances. 
The message is the same here: MCB8 is the best algo- 
rithm, or closer on average to the best than the other al- 
gorithms. This is clearly seen in Table 7, which is similar 
to Table 1, and shows the average and maximum percent 
degradation from best for all algorithms for large prob- 
lem instances. According to both metrics MCB8 is the 
best algorithm, with MCB5 performing well but not as 
well as MCB8. 

In terms of run times. Figure 10 shows run times un- 
der one-half second for 500 tasks for all of the MCB al- 
gorithms. MCB8 is again the fastest by a tiny margin. 

Based on our results we conclude that MCB 8 is the 



algorithm 
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avg. 


max 


MCB8 


0.09 


3.16 


MCB5 


0.25 


3.50 


MCB6 


0.46 


16.68 


MCB7 


1.04 


48.39 


MCB3 


4.07 


64.71 


MCB2 


8.68 


46.68 


MCBl 


10.97 


73.33 


MCB4 


14.80 


61.20 



Figure 7. Average and Maximum percent 
degradation from best of the MCB algo- 
rithms for large problem instances. 



best option among the 8 multi-capacity bin packing op- 
tions. In all that follows, to avoid graph clutter, we ex- 
clude the 7 other algorithms from our overall results. 

4.2.2. Small Problems Figure 1 1 shows the achieved 
maximum minimum yield versus the memory slack in 
the system for our algorithms, the MILP solution, and for 
the solution of the rational LP, which is an upper bound 
on the achievable solution. The solution of the LP is only 
about 4% higher on average than that of the MILP, al- 
though it is significantly higher for very low slack values. 
The solution of the LP will be interesting for large prob- 
lem instances, for which we cannot compute an exact so- 
lution. On average, the exact MILP solution is about 2% 
better than MCB8, and about 1 1% to 13% better than the 
greedy algorithms. All greedy algorithms exhibit roughly 
the same performance. The RRND and RRNZ algorithms 
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Figure 9. MCB Algorithms - Failure Rate 
vs. Slack for large problem instances. 
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Figure 11. Minimum Yield vs. Slack for 
small problem instances. 



lead to results markedly poorer than the other algorithms, 
with expectedly the RRNZ algorithm slightly outperform- 
ing the RRND algorithm. Interestingly, once the slack 
reaches 0.2 the results of both the RRND and RRNZ al- 
gorithms begin to worsen. 

Figure 1 2 is similar to Figure 1 1 but plots the average 
yield. The solution to the rational LP, the MILP solution, 
the MCB 8 solution, and the solutions produced by the 
greedy algorithms are all within a few percent of each 
other. As in Figure 11, when the slack is lower than 0.2 
the relaxed solution is significantly better. 

Figure 13 plots the failure rates of our algorithms. The 
RRND algorithm has the worst failure rate, followed by 
GR and then RRNZ. There were a total of 60 instances 
out of the 1,440 generated which were judged to be infea- 



sible because the GLPK solver could not find a solution 
for them. We see that the MCB 8, SG, and SGB algo- 
rithms have failure rates that are not significantly larger 
than that of the exact MILP solution. Out of the 1,380 fea- 
sible instances, the GB and SGB never fail to find a so- 
lution, the MCB8 algorithm fails once, and the SG algo- 
rithm fails 15 times. 

Figure 14 shows the run times of the various algo- 
rithms on a 3.2GHz Intel Xeon processor. The computa- 
tion time of the exact MILP solution is so much greater 
than that of the other algorithms that it cannot be seen 
on the graph. Computing the exact solution to the MILP 
took an average of 28.7 seconds, however there were 9 
problem instances with solutions that took over 500 sec- 
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Figure 12. Average Yield vs. Slacl< for 
small problem instances. 
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Figure 14. Run time vs. Number of Tasks 
for small problems instances. 
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Figure 13. Failure Rate vs. Slack for small 
problem instances. 



onds to compute, and a single problem instance that re- 
quired 1 1,549.29 seconds (a little over 3 hours) to solve. 
For the small problem instances the average run times of 
all greedy algorithms and of the MCB8 algorithm are un- 
der 0.15 milliseconds, with the simple GR and SG al- 
gorithms being the fastest. The RRND and RRNZ al- 
gorithms are significantly slower, with run times a lit- 
tle over 2 milliseconds on average; they also cannot be 
seen on the graph. 

4.2.3. Large Problems 

Figures 15, 16, 17, and 18 are similar to Figures 11, 
12, 13, and 14 respectively, but for large problem in- 
stances. In Figure 15 we can see that MCB8 algorithm 



achieves far better results than any other heuristic. Fur- 
thermore, MCB8 is extremely close to the upper bound 
as soon as the slack is 0.3 or larger and is only 8% away 
from this upper bound when the slack is 0.2. When the 
slack is 0.1, MCB8 is 37% away from the upper bound 
but we have seen with the small problem instances that 
in this case the upper bound is significantly larger than 
the actual optimal (see Figure 11). 

The performance of the greedy algorithms has wors- 
ened relative to the rational LP solution, on average 20% 
lower for slack values larger than 0.2. The GR and GB al- 
gorithms perform nearly identically, showing that back- 
tracking does not help on the large problem instances. 
The RRNZ algorithm is again a poor performer, with a 
profile that, unexpectedly, drops as slack increases. The 
RRND algorithm not only achieved the lowest values for 
minimum yield, but also completely failed to solve any in- 
stances of the problem for slack values less than 0.4. 

Figure 16 shows the achieved average yield values. 
The MCB8 algorithm again tracks the optimal for slack 
values larger than 0.3. A surprising observation at first 
glance is that the greedy algorithms manage to achieve 
higher average yields than the optimal or MCB algo- 
rithms. This is due to their lower achieved minimum 
yields. Indeed, with a lower minimum yield, average 
yield maximization is less constrained, making it possi- 
ble to achieve higher average yield than when starting 
from and allocation optimal for the minimum yield. The 
greedy algorithms thus trade off fairness for higher av- 
erage performance. The RRNZ algorithm starts out do- 
ing well for average slack, even better than GR or GB 
when the slack is low, but does much worse as slack in- 
creases. 

Figure 1 7 shows that for large problem instances the 
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GB and SGB algorithms have nearly as many failures as 
the OR and SO algorithms when slack is low. This sug- 
gests that the arbitrary bound of 500,000 placement at- 
tempts when backtracking, which was more than suffi- 
cient for the small problem set, has little affect on over- 
all performance for the large problem set. It could thus be 
advisable to set the bound on the number of placement at- 
tempts based on the size of the problem set and time al- 
lowed for computation. The RRND algorithm is the only 
algorithm with a significant number of failures for slack 
values larger than 0.3. The SO, SGB and MCB8 algo- 
rithms exhibit the lowest failure rates, about 40% lower 
than that experienced by the other greedy and RRNZ al- 
gorithms, and more than 14 times lower than the failure 
rate of the RRND algorithm. Keep in mind that, based 
on our experience with the small problem set, some of 
the problem instances with small slacks may not be feasi- 
ble at all. 

Figure 1 8 plots the average time needed to compute 
the solution to VCSCHED on a 3.2GHz Intel Xeon for all 
the algorithms versus the number of jobs. The RRND and 
RRNZ algorithms require significant time, up to roughly 
650 seconds on average for 500 tasks, and so cannot be 
seen at the given scale. This is attributed to solving the 
relaxed MILP using GLPK. Note that this time could 
be reduced significantly by using a faster solver (e.g., 
CPLEX [. ,]). The GB and SGB algorithms require sig- 
nificantly more time when the number of tasks is small. 
This is because the failure rate decreases as the number 
of tasks increases. For a given set of parameters, increas- 
ing the number of tasks decreases granularity. Since there 
is a relatively large number of unsolvable problems when 
the number of tasks is small, these algorithms spend a 
lot of time backtracking and searching though the so- 
lution space fruitlessly, ultimately stopping only when 
the bounded number of backtracking attempts is reached. 
The greedy algorithms are faster than the MCB8 algo- 
rithm, returning solutions in 15 to 20 milliseconds on 
average for 500 tasks as compared to nearly half a sec- 
ond for MCB8. Nevertheless, less than .5 seconds for 500 
tasks is clearly acceptable in practice. 

4.2.4. Discussion Our main result is that the multi- 
capacity bin packing algorithm that sorts tasks in descend- 
ing order by their largest resource requirement (MCB8) is 
the algorithm of choice. It outperforms or equals all other 
algorithm nearly across the board in terms of minimum 
yield, average yield, and failure rate, while exhibiting rel- 
atively low run times. The sorted greedy algorithms (SG 
or SGB) lead to reasonable results and could be used 
for very large numbers of tasks, for which the run time 
of MCB8 may become too high. The use of backtrack- 
ing in the algorithms GB and SGB led to performance 
improvements for small problem sets but not for large 
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Figure 15. Minimum Yield vs. Slacic for 
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Figure 16. Average Yield vs. Slack for large 
problem instances. 



problem sets, suggesting that some sort of backtrack- 
ing system with a problem-size- or run-time-dependent 
bound on the number of branches to explore could poten- 
tially be effective. 

5. Parallel Jobs 

5.1. Problem Formulation 

In this section we explain how our approach and al- 
gorithms can be easily extended to handle parallel jobs 
that consist of multiple tasks (relaxing assumption H3). 
We have thus far only concerned ourselves with indepen- 
dent jobs that are both indivisible and small enough to 
run on a single machine. However, in many cases users 
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Figure 17. Failure Rate vs. Slack for large 
problem instances. 
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Figure 18. Run time vs. Number of Tasks 
for large problem instances. 



may want to split up jobs into multiple tasks, either be- 
cause they wish to use more CPU power in order to re- 
turn results more quickly or because they wish to process 
an amount of data that does not fit comfortably within 
the memory of a single machine. 

One naive way to extend our approach to parallel 
jobs would be to simply consider the tasks of a job in- 
dependently. In this case individual tasks of the same 
job could then receive different CPU allocations. How- 
ever, in the vast majority of parallel jobs it is not useful 
to have some tasks run faster than others as either the job 
makes progress at the rate of the slowest task or the job 
is deemed complete only when all tasks have completed. 
Therefore, we opt to add constraints to our linear pro- 



gram to enforce that the CPU allocations of tasks within 
the same job must be identical. It would be straightfor- 
ward to have more sophisticated constraints if specific 
knowledge about a particular job is available (e.g., task 
A should receive twice as much CPU as task B). 

Another important issue here is the possibility of gam- 
ing the system when optimizing the average yield. When 
optimizing the minimum yield, a division of a job into 
multiple tasks that leads to a higher minimum yield bene- 
fits all jobs. However, when considering the average yield 
optimization, which is done in our approach as a second 
round of optimization, a problem arises because the aver- 
age yield metric favors small tasks, that is, tasks that have 
low CPU requirements. Indeed, when given the choice to 
increase the CPU allocation of a small task or of a larger 
task, for the same additional fraction of CPU, the abso- 
lute yield increase would be larger for the small task, 
and thus would lead to a higher average yield. There- 
fore, an unscrupulous user might opt for breaking his/her 
job into unnecessarily many smaller tasks, perhaps hurt- 
ing the parallel efficiency of the job, but acquiring an over- 
all larger portion of the total available CPU resources, 
which could lead to shorter job execution time. To rem- 
edy this problem we use a per-job yield metric (i.e., total 
CPU allocation divided by total CPU requirements) dur- 
ing the average yield optimization phase. 

The linear programming formulation with these addi- 
tional considerations and constraints is very similar to 
that derived in Section 2.5. We again consider jobs 1..J 
and hosts 1..H. But now each job i consists of Ti tasks. 
Since these jobs are constrained to be uniform, ai rep- 
resents the maximum CPU consumption and rrii repre- 
sents the maximum memory consumption of all tasks 
k of job i. The integer variables eikj are constrained to 
be either or 1 and represent the absence or presence 
of task k of job i on host j. The variables aikj repre- 
sent the amount of CPU allocated to task A; of job i on 
host j. 



yi,k,j 


Btkj e N, 


(12) 


yi,k,j 


aikj e Q, 


(13) 


yi,k,j 


< Cikj < 1, 


(14) 


yi,k,j 


< ttikj < Cikj, 


(15) 


Vi, k 


Sj = l ^ikj — 1; 


(16) 


Vj 


'E^=lT,k=l'^^kJ < 1, 

Y.'LiT^k=i^^kj'm, < 1, 


(17) 


Vj 


(18) 


Vz, k 




(19) 


Vi, k, k' 


2^j = l '^ikj — 2^i = l '^ik'j, 


(20) 


Vi 


Y^-H" Y^Ti aikj ^ -t^ 
Z^j = l Z^k=l TiXai — ^ 


(21) 



Note that the final constraint is logically equivalent to 
the per-task yield since all tasks are constrained to have 
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Figure 19. Minimum Yield vs. Slacic for 
large problem instances for parallel jobs. 
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Figure 20. Average Yield vs. Slack for large 
problem instances for parallel jobs. 



the same CPU allocation. The reason for writing it this 
way is to highlight that in the second phase of optimiza- 
tion one should maximize the average per-job yield rather 
than the average per-task yield. 

5.2. Results 

The algorithms described in Section 3 for the case of 
sequential jobs can be used directly for minimum yield 
maximization for parallel jobs. The only major differ- 
ence is that the average per-task yield optimization phase 
needs to be changed for an average per-job optimiza- 
tion phase. As with the per-task optimization, we make 
the simplifying assumption that task placement decisions 
cannot be changed during this phase of the optimiza- 
tion. This simplification removes not only the difficulty 
of solving a MILP, but also allows us to avoid the enor- 
mous number of additional constraints which would be 
required to make sure that all of a given job's tasks re- 
ceive the same allocation while keeping the problem lin- 
ear. 

We present results only for large problem instances 
as defined in Section 4. 1 . We use the same experimen- 
tal methodology as defined there as well. We only need a 
way to decide how many tasks comprise a parallel job. To 
this end, we use the parallel workload model proposed 
in [30], which models many characteristics of parallel 
workloads (derived based on statistical analysis of real- 
world batch system workloads). The model for the num- 
ber of tasks in a parallel job uses a two-stage log-uniform 
distribution biased towards powers of two. We instanti- 
ate this model using the same parameters as in [ ], as- 
suming that jobs can consist of between 1 and 64 tasks. 

Figure 19 shows results for the SG and the MCB8 al- 
gorithms. We exclude all other greedy algorithms as they 
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Figure 21. Failure Rate vs. Slack for large 
problem instances for parallel jobs. 



were all shown to be outperformed by SG, all other MCB 
algorithms because they were all shown to be outper- 
formed by MCB 8, as well as the RRND and RRNZ al- 
gorithms which were shown to perform poorly. The fig- 
ure also shows the upper bound on optimal obtained as- 
suming that Cij variables can take rational values. We 
see that MCB 8 outperforms the SGB algorithm signif- 
icantly and is close to the upper bound on optimal for 
slacks larger than 0.3. 

Figure 20 shows the average job yield. We see the 
same phenomenon as in Figure 16, namely that the greedy 
algorithm can achieve higher average yield because it 
starts from a lower minimum yield, and thus has more op- 
tions to push the average yield higher (thereby improving 
average performance at the expense of fairness). 
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Figure 22. Runtime vs. Number of Tasl<s 
for large problem instances for parallel 
jobs. 



Figure 2 1 shows the failure rates of the MCB8 and SG 
algorithms, which are identical. Finally Figure 22 shows 
the run time of both algorithms. We see that the SG al- 
gorithm is much faster than the MCB8 algorithm (by 
roughly a factor 32 for 500 tasks). Nevertheless, MCB8 
can still compute an allocation in under one half a sec- 
ond for 500 tasks. 

Our conclusions are similar to the ones we made when 
examining results for sequential jobs: in the case of par- 
allel jobs the BCB8 algorithm is the algorithm of choice 
for optimizing minimum yield, while the SGB algorithm 
could be an alternate choice if the number of tasks is very 
large. 



6. Dynamic Workloads 

In this section we study resource allocation in the 
case when assumption H4 no longer holds, meaning that 
the workload is no longer static. We assume that job re- 
source requirements can change and that jobs can join 
and leave the system. When the workload changes, one 
may wish to adapt the schedule to reach a new (nearly) 
optimal allocation of resources to the jobs. This adap- 
tation can entail two types of actions: (i) modifying the 
CPU fractions allocated to some jobs; and (ii) migrat- 
ing jobs to different physical hosts. In what follows we 
extend the linear program formulation derived in Sec- 
tion 2.5 to account for resource allocation adaptation. We 
then discuss how current technology can be used to im- 
plement adaptation with virtual clusters. 



6.1. Mixed-Integer Linear Program Formula- 
tion 

One difficult question for resource allocation adap- 
tation, regardless of the context, is whether the adapta- 
tion is "worth it." Indeed, adaptation often comes with an 
overhead, and this overhead may lead to a loss of perfor- 
mance. In the case of virtual cluster scheduling, the over- 
head is due to VM migrations. The question of whether 
adaptation is worthwhile is often based on a time hori- 
zon (e.g., adaptation is not worthwhile if the workload 
is expected to change significantly in the next 5 min- 
utes) [ 1 , 45]. In virtual cluster scheduling, as defined 
in this paper, jobs do not have time horizons. Therefore, 
in principle, the scheduler cannot reason about when re- 
source needs will change. It may be possible for the 
scheduler to keep track of past workload behavior to fore- 
cast future workload behavior. Statistical workload mod- 
els have been built (see [29, 30] for models and litera- 
ture reviews). Techniques to make predictions based on 
historical information have been developed (see [ . ] for 
task execution time models and a good literature review). 
Making sound short-term decisions for resource alloca- 
tion adaptation requires highly accurate predictions, so 
as to carry out precise cost-benefit analyses of various 
adaptation paths. Unfortunately, accurate point predic- 
tions (rather than statistical characterizations) are elusive 
due to the inherently statistical and transient nature of 
the workload, as seen in the aforementioned works. Fur- 
thermore, most results in this area are obtained for batch 
scheduling environments with parallel scientific appli- 
cations, and it is not clear whether the obtained mod- 
els would be applicable in more general settings (e.g., 
cloud computing environments hosting internet services). 

Faced with the above challenge, rather than attempt- 
ing arduous statistical forecasting of adaption cost and 
pay-off, we side-step the issue and propose a pragmatic 
approach. We consider schedule adaptation that attempts 
to achieve the best possible yield, but so that job migra- 
tions do not entail moving more than some fixed num- 
ber of bytes, B (e.g., to limit the amount of network load 
due to schedule adaptation). If B is set to 0, then the 
adaptation will do the best it can without using migra- 
tion whatsoever If B is above the sum of the job sizes (in 
bytes of memory requirement), then all jobs could be mi- 
grated. 

It turns out that this adaptation scheme can be easily 
formulated as a mixed-integer linear program. More gen- 
erally, the value of B can be chosen so that it achieves 
a reasonable trade-off between overhead and workload 
dynamicity. Choosing the best value for B for a partic- 
ular system could however be difficult and may need 
to be adaptive as most workloads are non- stationary. A 
good approach is likely to pick relatively smaller val- 
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ues of B for more dynamic workload. We leave a study 
of how to best tune parameter B for future work. 

We use the same notations and definitions as in Sec- 
tion 2.5. In addition, we consider that some jobs are al- 
ready assigned to a host: is equal to 1 if job i is al- 
ready running on host j, and otherwise. For reasons 
that will be clear after we explain our constraints, we sim- 
ply set (iij to 1 for all j if job i corresponds to a newly ar- 
rived job. Newly departed jobs need not be taken into 
account. We can now write a new set of constraints as fol- 
lows: 
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The objective, as in Section 2.5, is to maximize Y. 
The only new constraint is the last one. This constraint 
simply states that if job i is assigned to a host that is dif- 
ferent from the host to which it was assigned previously, 
then it needs to be migrated. Therefore, bytes need to 
be transferred. These bytes are summed over all jobs in 
the system to ensure that the total number of bytes com- 
municated for migration purposes does not exceed B. 
Note that this is still a linear program as is not a vari- 
able but a constant. Since for newly arrived jobs we set all 
Eij values to 1, we can see that they do not contribute to 
the migration cost. Note that removing in the last con- 
straint would simply mean that i? is a bound on the total 
number of job migrations allowed during schedule adap- 
tation. 

We leave the development of heuristic algorithms for 
solving the above linear program for future work. 

6.2. Technology Issues for Resource Allocation 
Adaptation 

In the linear program in the previous section nowhere 
do we account for the time it takes to migrate a job. While 
a job is being migrated it is presumably non-responsive, 
which impacts the yield. However, modern VM moni- 
tors support "live migration" of VM instances, which 
allows migrations with only milliseconds of unrespon- 
siveness [i j]. There could be a performance degradation 



due to memory pages being migrated between two phys- 
ical hosts. Resource allocation adaptation also requires 
quick modification of the CPU share allocated to a VM in- 
stance (assumption H5). We validate this assumption in 
Section 7 and find that, indeed, CPU shares can be modi- 
fied accurately in under a second. 

7. Evaluation of the Xen Hypervi- 
sor 

Assumption H5 in Section 2.2 states that VM tech- 
nology allows for precise, low-overhead, and quickly 
adaptable sharing of the computational capabilities of 
a host across CPU-bound VM instances. Although this 
seems like a natural expectation, we nevertheless vali- 
date this assumption with state-of-the-art virtualization 
technology, namely, the Xen VM monitor [ ]. While vir- 
tualization can happen inside the operating system (e.g. 
Virtual PC [ ], VMWare [ ]), Xen runs between the 
hardware and the operating system. It thus requires ei- 
ther a modified operating system ("paravirtualization") or 
hardware support for virtualization ("hardware virtualiza- 
tion" [ ]). In this work we use Xen 3.1 on a dual-CPU 
64-bit machine with paravirtualization. All our VM in- 
stances use identical 64-bit Fedora images, are allocated 
700MB of RAM, and run on the same physical CPU. The 
other CPU is used to run the experiment controller. All 
our VM instances perform continuous CPU-bound com- 
putations, that is, 100 X 100 double precision matrix mul- 
tiplications using the LAPACK DGEMM routine [5]. 

Our experiments consist in running from one to ten 
VM instances with specified "cap values", which Xen 
uses to control what fraction of the CPU is allocated to 
each VM. We measure the effective compute rate of each 
VM instance (in number of matrix multiplications per 
seconds). We compare this rate to the expected rate, that 
is, the cap value times the compute rate measured on 
the raw hardware. We can thus ascertain both the accu- 
racy and the overhead of the CPU- sharing in Xen. We 
also conduct experiments in which we change cap val- 
ues on-the-fly and measure the delay before the effective 
compute rates are in agreement with the new cap val- 
ues. 

Due to space limitations we only provide highlights 
of our results and refer the reader to a technical report 
for full details [ ]. We found that Xen imposes a mini- 
mal overhead (on average a 0.27% slowdown). We also 
found that the absolute error between the effective com- 
pute rate and the expected compute rate was at most 
5.99% and on average 0.72%. In terms of responsive- 
ness, we found that the effective compute rate of a VM 
becomes congruent with a cap value less than one sec- 
ond after that cap value was changed. We conclude that, 
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in the case of CPU-bound VM instances, CPU-sharing in 
Xen is sufficiently accurate and responsive to enable frac- 
tional and dynamic resource allocations as defined in this 
paper. 

8. Related Work 

The use of virtual machine technology to improve par- 
allel job reliability, cluster utilization, and power effi- 
ciency is a hot topic for research and development, with 
groups at several universities and in industry actively de- 
veloping resource management systems [2,3, 19,32,42]. 
This paper builds on top of such research, using the re- 
source manager intelligently to optimize a user-centric 
metric that attempts to capture common ideas about fair- 
ness among the users of high-performance systems. 

Our work bears some similarities with gang schedul- 
ing. However, traditional gang scheduling approaches 
suffer from problems due to memory pressure and the 
communication expense of coordinating context switches 
across multiple hosts [9, 38]. By explicitly considering 
task memory requirements when making scheduling de- 
cisions and using virtual machine technology to multi- 
plex the CPU resources of individual hosts our approach 
avoids these problems. 

Above all, our approach is novel in that we define and 
optimize for a user-centric metric which captures both 
fairness and performance in the face of unknown time 
horizons and fluctuating resource needs. Our approach 
has the additional advantage of allowing for interactive 
job processes. 

9. Limitations and Future Direc- 
tions 

In this work we have made two key assumptions. The 
first assumption is that VM instances are CPU-bound 
(assumption HI), which made it possible to validate as- 
sumption H5 in Section 7. However, in reality, VM in- 
stances may have composite needs that span multiple 
resources, including the network, the disk, and the mem- 
ory bus. The second assumption is that resource needs 
are known (assumption H2). However, this typically does 
not hold true in practice as users do not know precise re- 
source needs of their applications. When assumption HI 
does not hold, the challenge is to model composite re- 
source needs in the definition of the resource allocation 
problem, and to share these various resources among VM 
instances in practice. 

In practice, CPU and network resources are strongly 
dependent within a virtual machine monitor environment. 
To ensure secure isolation, VM monitors interpose on 



network communication, adding CPU overhead as a re- 
sult. Experience has shown that, because of this depen- 
dence, one can capture network needs in terms of addi- 
tional CPU need [20]. Therefore, it should be straight- 
forward to modify our approach to account for network 
resource usage. In terms of disk usage, we note that vir- 
tual cluster environments typically use network-attached 
storage to simplify VM migration. As a result, disk us- 
age is subsumed in network usage. In both cases one 
should then be able to both model and precisely share 
network and disk usage. Much more challenging is the 
modeling and sharing of the memory bus usage, due to 
complex and deep memory hierarchies on multi-core pro- 
cessors. However, current work on Virtual Private Ma- 
chines points to effective ways for achieving sharing and 
performance isolation among VM instances of microar- 
chitecture resources [37], including the memory hierar- 
chy [ t)]. 

In terms of discovering VM instance resource needs, 
a first approach is to use standard services for tracking 
VM resource usage across a cluster and collecting the in- 
formation as input into a cluster system scheduler (e.g., 
the XenMon VM monitoring facility in Xen [ ]. Appli- 
cation resource needs inside a VM instance can be dis- 
covered via a combination of introspection and configu- 
ration variation. With introspection, for example, one can 
deduce application CPU needs by inferring process activ- 
ity inside of VMs [47], and memory pressure by inferring 
memory page eviction activity [ ]. This kind of moni- 
toring and inference provides one set of data points for 
a given system configuration. By varying the configura- 
tion of the system, one can then vary the amount of re- 
sources given to applications in VMs, track how they 
respond to the addition or removal of resources, and 
infer resource needs. Experience with such techniques 
in isolation has shown that they can be surprisingly ac- 
curate [47,48]. Furthermore, modeling resource needs 
across a range of configurations with high accuracy is less 
important than discovering where in that range the appli- 
cation experiences an inflection point (e.g., cannot make 
use of further CPU or memory) [ 1 7, 40]. 

10. Conclusion 

In this paper we have proposed a novel approach for 
allocating resources among competing jobs, relying on 
Virtual Machine technology and on the optimization of 
a well-defined metric that captures notions of perfor- 
mance and of fairness. We have given a formal defini- 
tion of a base problem, have proposed several algorithms 
to solve it, and have evaluated these algorithms in simula- 
tion. We have identified a promising algorithm that runs 
quickly, is on par with or better than its competitors, and 
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is close to optimal in terms of our objective function. We 
have then discussed several extensions to our approach to 
solve more general problems, namely when jobs are par- 
allel, when the workload is dynamic, when job resource 
needs are composite, and when job resource needs are un- 
known. 

Future directions include the development of algo- 
rithms to solve the resource allocation adaptation prob- 
lem, and of strategies for estimating job resource needs 
accurately. Our ultimate goal is to develop a new resource 
allocator as part of the Usher system [ ], so that our al- 
gorithms and techniques can be used as part of a practical 
system and evaluated in practical settings. 
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